NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Homoploid hybridization adds clarity to the origins of octoploid strawberries

https://doi.org/10.1073/pnas.2502814122

Fan, Zhen; Liston, Aaron; Soltis, Douglas E; Soltis, Pamela S; Ashman, Tia-Lynn; Hummer, Kim E; Whitaker, Vance M (June 2025, Proceedings of the National Academy of Sciences)

The evolutionary histories of many polyploid plant species are difficult to resolve due to a complex interplay of hybridization, incomplete lineage sorting, and missing diploid progenitors. In the case of octoploid strawberry with four subgenomes designated ABCD, the identities of the diploid progenitors for subgenomes C and D have been subject to much debate. By integrating new sequencing data from North American diploids with reticulate phylogeny and admixture analyses, we uncovered introgression from an extinct or unsampled species in the clade ofFragaria viridis,Fragaria nipponica, andFragaria nilgerrensisinto the donor of subgenome A of octoploidFragariaprior to its divergence fromF. vescasubsp. bracteata. We also detected an introgression event fromF. iinumaeinto an ancestor ofF. nipponicaandF. nilgerrensis.Using an LTR-age-distribution-based approach, we estimate that the octoploid and its intermediate hexaploid and tetraploid ancestors emerged approximately 0.8, 2, and 3 million years ago, respectively. These results provide an explanation for previous reports ofF. viridisandF. nipponicaas donors of the C and D subgenomes and suggest a greater role than previously thought for homoploid hybridization in the diploid progenitors of octoploid strawberry. The integrated set of approaches used here can help advance polyploid genome analysis in other species where hybridization and incomplete lineage sorting obscure evolutionary relationships.
more » « less
Full Text Available
Homoploid Hybridization Resolves the Origin of Octoploid Strawberries

https://doi.org/10.1101/2024.09.12.612680

Fan, Zhen; Liston, Aaron; Soltis, Douglas; Soltis, Pamela; Ashman, Tia-Lynn; Hummer, Kim; Whitaker, Vance M (September 2024, bioRxiv)

Abstract The identity of the diploid progenitors of octoploid cultivated strawberry (Fragaria × ananassa) has been subject to much debate. Past work identified four subgenomes and consistent evidence forF. californica(previously namedF. vescasubsp.bracteata) andF. iinumaeas donors for subgenomes A and B, respectively, with conflicting results for the origins of subgenomes C and D. Here, reticulate phylogeny and admixture analysis support hybridization betweenF. viridisandF. vescain the ancestry of subgenome A, and betweenF. nipponicaandF. iinumaein the ancestry of subgenome B. Using an LTR-age-distribution-based approach, we estimate that the octoploid and its intermediate hexaploid and tetraploid ancestors emerged approximately 0.8, 2, and 3 million years ago, respectively. These results provide an explanation for previous reports ofF. viridisandF. nipponicaas donors of the C and D subgenomes and unify conflicting hypotheses about the evolutionary origin of octoploidFragaria.
more » « less
Full Text Available
CSurF: Sparse Lexical Retrieval through Contextualized Surface Forms

https://doi.org/10.1145/3578337.3605126

Fan, Zhen; Gao, Luyu; Callan, Jamie (August 2023, ACM)

Lexical exact-match systems perform text retrieval efficiently with sparse matching signals and fast retrieval through inverted lists, but naturally suffer from the mismatch between lexical surface form and implicit term semantics. This paper proposes to directly bridge the surface form space and the term semantics space in lexical exact-match retrieval via contextualized surface forms (CSF). Each CSF pairs a lexical surface form with a context source, and is represented by a lexical form weight and a contextualized semantic vector representation. This framework is able to perform sparse lexicon-based retrieval by learning to represent each query and document as a "bag-of-CSFs", simultaneously addressing two key factors in sparse retrieval: vocabulary expansion of surface form and semantic representation of term meaning. At retrieval time, it efficiently matches CSFs through exact-match of learned surface forms, and effectively scores each CSF pair via contextual semantic representations, leading to joint improvement in both term match and term scoring. Multiple experiments show that this approach successfully resolves the main mismatch issues in lexical exact-match retrieval and outperforms state-of-the-art lexical exact-match systems, reaching comparable accuracy as lexical all-to-all soft match systems as an efficient exact-match-based system.
more » « less
COILcr: Efficient semantic matching in contextualized exact match retrieval

Fan, Zhen; Gao, Luyu; Jha, Rohan; Callan, Jamie (April 2023, Advances in Information Retrieval – 44th European Conference on IR Research)

Lexical exact match systems that use inverted lists are a fundamental text retrieval architecture. A recent advance in neural IR, COIL, extends this approach with contextualized inverted lists from a deep language model backbone and performs retrieval by comparing contextualized query-document term representation, which is effective but computationally expensive. This paper explores the effectiveness-efficiency tradeoff in COIL-style systems, aiming to reduce the computational complexity of retrieval while preserving term semantics. It proposes COILcr, which explicitly factorizes COIL into intra-context term importance weights and cross-context semantic representations. At indexing time, COILcr further maps term semantic representations to a smaller set of canonical representations. Experiments demonstrate that canonical representations can efficiently preserve term semantics, reducing the storage and computational cost of COIL-based retrieval while maintaining model performance. The paper also discusses and compares multiple heuristics for canonical representation selection and looks into its performance in different retrieval settings.
more » « less
Full Text Available
Complement lexical retrieval model with semantic residual embeddings

Gao, Luyu; Dai, Zhuyun; Chen, Tongfei; Fan, Zhen; Van Durme, Benjamin; Callan, Jamie (March 2021, Advances in Information Retrieval – 43rd European Conference on IR Research)
null (Ed.)
This paper presents CLEAR, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model. CLEAR explicitly trains the neural embedding to encode language structures and semantics that lexical retrieval fails to capture with a novel residual-based embedding learning method. Empirical evaluations demonstrate the advantages of CLEAR over state-of-the-art retrieval models, and that it can substantially improve the end-to-end accuracy and efficiency of reranking pipelines.
more » « less
Full Text Available
Incorporating Multimodal Information in Open-Domain Web Keyphrase Extraction

https://doi.org/10.18653/v1/2020.emnlp-main.140

Wang, Yansen; Fan, Zhen; Rose, Carolyn (January 2020, Incorporating Multimodal Information in Open-Domain Web Keyphrase Extraction)
null (Ed.)
Open-domain Keyphrase extraction (KPE) on the Web is a fundamental yet complex NLP task with a wide range of practical applications within the field of Information Retrieval. In contrast to other document types, web page designs are intended for easy navigation and information finding. Effective designs encode within the layout and formatting signals that point to where the important information can be found. In this work, we propose a modeling approach that leverages these multi-modal signals to aid in the KPE task. In particular, we leverage both lexical and visual features (e.g., size, font, position) at the micro-level to enable effective strategy induction, and metalevel features that describe pages at a macrolevel to aid in strategy selection. Our evaluation demonstrates that a combination of effective strategy induction and strategy selection within this approach for the KPE task outperforms state-of-the-art models. A qualitative post-hoc analysis illustrates how these features function within the model.
more » « less
Full Text Available
Quasi-one-dimensional metallic conduction channels in exotic ferroelectric topological defects

https://doi.org/10.1038/s41467-021-21521-9

Yang, Wenda; Tian, Guo; Zhang, Yang; Xue, Fei; Zheng, Dongfeng; Zhang, Luyong; Wang, Yadong; Chen, Chao; Fan, Zhen; Hou, Zhipeng; et al (December 2021, Nature Communications)
null (Ed.)
Abstract Ferroelectric topological objects provide a fertile ground for exploring emerging physical properties that could potentially be utilized in future nanoelectronic devices. Here, we demonstrate quasi-one-dimensional metallic high conduction channels associated with the topological cores of quadrant vortex domain and center domain (monopole-like) states confined in high quality BiFeO 3 nanoislands, abbreviated as the vortex core and the center core. We unveil via the phase-field simulation that the superfine metallic conduction channels along the center cores arise from the screening charge carriers confined at the core region, whereas the high conductance of vortex cores results from a field-induced twisted state. These conducting channels can be reversibly created and deleted by manipulating the two topological states via electric field, leading to an apparent electroresistance effect with an on/off ratio higher than 10 3 . These results open up the possibility of utilizing these functional one-dimensional topological objects in high-density nanoelectronic devices, e.g. nonvolatile memory.
more » « less
Full Text Available
Local Matching Networks for Engineering Diagram Search

https://doi.org/10.1145/3308558.3313500

Dai, Zhuyun; Fan, Zhen; Rahman, Hafeezul; Callan, Jamie (May 2019, The World Wide Web Conference, WWW 2019)

Finding diagrams that contain a specific part or a similar part is important in many engineering tasks. In this search task, the query part is expected to match only a small region in a complex image.This paper investigates several local matching networks that explicitly model local region-to-region similarities. Deep convolutional neural networks extract local features and model local matching patterns. Spatial convolution is employed to cross-match local regions at different scale levels, addressing cases where the target part appears at a different scale, position, and/or angle. A gating network automatically learns region importance, removing noise from sparse areas and visual metadata in engineering diagrams. Experimental results show that local matching approaches are more effective for engineering diagram search than global matching approaches. Suppressing unimportant regions via the gating net-work enhances accuracy. Matching across different scales via spatial convolution substantially improves robustness to scale and rotation changes. A pipelined architecture efficiently searches a large collection of diagrams by using a simple local matching network to identify a small set of candidate images and a more sophisticated network with convolutional cross-scale matching to re-rank candidates.
more » « less
Full Text Available
Optimizing Performance and Computing Resource Management of In-memory Big Data Analytics with Disaggregated Persistent Memory

https://doi.org/10.1109/CCGRID.2019.00012

Chen, Shouwei; Wang, Wensheng; Wu, Xueyang; Fan, Zhen; Huang, Kunwu; Zhuang, Peiyu; Li, Yue; Rodero, Ivan; Parashar, Manish; Weng, Dennis (May 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID))

The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can suffer a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at affordable cost by co-designing the proposed in-memory distributed file system with large-volume DIMM-based persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a production-level cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5- fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
more » « less
Full Text Available

Search for: All records